Search CORE

6 research outputs found

SQL injection attack detection in network flow data

Author: Campazas-Vega Adrián
Crespo-Martínez Ignacio Samuel
Fernández Llamas Camino
Guerrero Higueras Ángel Manuel
Riego Del Castillo Virginia
Álvarez Aparicio Claudia
Publication venue: 'Elsevier BV'
Publication date: 23/01/2023
Field of study

[EN] SQL injections rank in the OWASP Top 3. The literature shows that analyzing network datagrams allows for detecting or preventing such attacks. Unfortunately, such detection usually implies studying all packets flowing in a computer network. Therefore, routers in charge of routing significant traffic loads usually cannot apply the solutions proposed in the literature. This work demonstrates that detecting SQL injection attacks on flow data from lightweight protocols is possible. For this purpose, we gathered two datasets collecting flow data from several SQL injection attacks on the most popular database engines. After evaluating several machine learning-based algorithms, we get a detection rate of over 97% with a false alarm rate of less than 0.07% with a Logistic Regression-based model.SIInstituto Nacional de Ciberseguridad de España (INCIBE)Universidad de Leó

Leon University (Spain)

Analyzing the influence of the sampling rate in the detection of malicious traffic on flow data

Author: Campazas-Vega Adrián
Crespo-Martínez Ignacio Samuel
Fernández Llamas Camino
Guerrero Higueras Ángel Manuel
Matellán Olivera Vicente
Álvarez Aparicio Claudia
Publication venue: Elsevier
Publication date: 18/08/2023
Field of study

[EN] Cyberattacks are a growing concern for companies and public administrations. The literature shows that analyzing network-layer traffic can detect intrusion attempts. However, such detection usually implies studying every datagram in a computer network. Therefore, routers routing a significant volume of network traffic do not perform an in-depth analysis of every packet. Instead, they analyze traffic patterns based on network flows. However, even gathering and analyzing flow data has a high-computational cost, and therefore routers usually apply a sampling rate to generate flow data. Adjusting the sampling rate is a tricky problem. If the sampling rate is low, much information is lost and some cyberattacks may be neglected, but if the sampling rate is high, routers cannot deal with it. This paper tries to characterize the influence of this parameter in different detection methods based on machine learning. To do so, we trained and tested malicious-traffic detection models using synthetic flow data gathered with several sampling rates. Then, we double-check the above models with flow data from the public BoT-IoT dataset and with actual flow data collected on RedCAYLE, the Castilla y León regional academic network.S

Leon University (Spain)

Detection of advanced persistent threats in communication networks using flow data

Author: Campazas-Vega Adrián
Publication venue
Publication date: 12/07/2023
Field of study

131 p.[ES] Las Advanced Persistent Threats (APTs) son una de las amenazas más preocupantes para gobiernos, organizaciones y empresas. Una de las principales características de una APT es la generación de tráfico malicioso en diversas etapas de su ciclo de vida. Se ha demostrado en la literatura que es posible detectar tráfico malicioso utilizando modelos de aprendizaje automático previamente entrenados con paquetes de red. Los paquetes de red contienen toda la información que se intercambia en una comunicación de red, incluyendo la carga útil. Existen redes que manejan una cantidad tan elevada de tráfico que no es posible analizar todos los paquetes que los enrutadores gestionan. Este tipo de infraestructuras se ven obligadas a utilizar protocolos basados en flujos para poder analizar lo que está sucediendo en la red. Un flujo se compone de un conjunto de paquetes IP que pasan por un punto de observación en la red durante cierto intervalo de tiempo. Todos los paquetes que pertenecen al mismo flujo tienen en común ciertas características, como las direcciones IP y los puertos, tanto de origen como de destino. Los flujos de red no almacenan la carga útil del paquete, lo que reduce la carga computacional en los enrutadores, pero al mismo tiempo se pierde gran parte de la información contenida en estos paquetes. Aun utilizando protocolos basados en flujos de red, hay redes que manejan tal cantidad de tráfico que para reducir la carga computacional de sus dispositivos, necesitan seleccionar un paquete de cada X a la hora de generar los flujos de red. Este proceso se conoce como muestreo. Este trabajo tiene como objetivo detectar tráfico de red malicioso, como el que puede ser generado por una APT, en este tipo de infraestructuras, aumentando la seguridad de empresas, organizaciones y usuarios. Para ello, se han analizado diferentes técnicas basadas en aprendizaje automático. Para entrenar modelos de aprendizaje automático, es necesario disponer de conjuntos de datos correctamente etiquetados. Con el fin de generar conjuntos de datos que contengan flujos de red recopilados aplicando diferentes umbrales de muestreo, se ha desarrollado y validado la herramienta Docker-based framework for gathering netflow data (DOROTHEA) como implementación de un marco propuesto previamente. DOROTHEA es una herramienta flexible y escalable que permite generar tráfico aislado, ya sea malicioso o benigno, permitiendo etiquetar inequívocamente los flujos de red generados. Para comprobar si es posible detectar tráfico malicioso en redes que utilizan protocolos basados en flujos con muestreo de paquetes, se han aplicado dos enfoques diferentes. Por un lado, se han entrenado algoritmos basados en aprendizaje supervisado. En este primer enfoque también se pretende analizar cómo afecta el umbral de muestreo a la detección de tráfico malicioso. En un segundo enfoque, se han utilizado modelos basados en la detección de anomalías. En la primera aproximación con modelos supervisados, se generaron conjuntos de datos recopilados con diferentes umbrales de muestreo que contenían ataques de escaneo de puertos, concretamente se utilizaron los siguientes umbrales: 1/250, 1/500, 1/1.000 y 10.000. Estos conjuntos de datos recopilados con DOROTHEA se utilizaron para entrenar y evaluar los modelos K-Nearest Neighbors (KNN), Logistic Regression (LR), Linear Support Vector Classification (LSVC), LSVC+Stochastic Gradient Descent (SGD), Multilayer Perceptron (MLP), y Random Forest (RF). Para comprobar su capacidad de generalización, se evaluaron estos modelos con flujos de red recopilados en los enrutadores en producción de RedCAYLE, la red académica regional de Castilla y León, y con el conjunto de datos público BoT-IoT. Ambos conjuntos de datos contenían flujos de red recopilados con un muestreo de 1 paquete de cada 1000. Los resultados obtenidos demostraron que es posible detectar tráfico malicioso en flujos de red muestreados utilizando modelos de detección basados en aprendizaje automático. Sin embargo, los resultados cambian significativamente en función de la frecuencia de muestreo. A medida que aumenta el umbral de muestreo, algunos modelos pierden su capacidad de detección. Sin embargo, se ha demostrado que los modelos KNN, MLP y RF mantienen su capacidad de detección en todos los umbrales estudiados, siendo el modelo KNN el que muestra mejores resultados. Posteriormente, se han generado modelos basados en la detección de anomalías. Estos modelos no están entrenados para detectar un tipo específico de ataque, sino para identificar el tráfico legítimo y considerar anómalo cualquier tráfico que se desvíe del patrón aprendido. Para comprobar si es posible detectar tráfico malicioso en redes que manejan una gran cantidad de tráfico, se han evaluado los modelos One-class Support Vector Machine (OC-SVM), e Isolation Forest (iForest) utilizando datos de flujo muestreados sintéticos y datos de flujo muestreados reales recopilados en RedCAYLE. Los conjuntos de datos de entrenamiento contenían únicamente tráfico benigno. Los conjuntos de datos de evaluación contenían ataques de escaneo de puertos. Además, el conjunto de datos sintético también contenía ataques de inyección SQL. El objetivo de incluir este tipo de ataque fue comprobar que este tipo de modelos tienen la capacidad de detectar ataques de red muy diferentes entre si. Los resultados demostraron que el modelo OC-SVM, obtuvo buenos resultados en la detección de ataques de red como anomalías tanto en tráfico sintético como en los flujos recopilados en los enrutadores de RedCAYLE. Estos resultados sugieren que este tipo de modelos basados en la detección de anomalías pueden ser capaces de detectar ataques desconocidos o incluso de día 0. A partir de los experimentos realizados, se puede concluir que es posible detectar tráfico malicioso en redes que manejan una gran cantidad de tráfico, aumentando la seguridad de la red. Esta tesis doctoral abre una serie de posibilidades para el futuro en relación a la detección de tráfico malicioso en grandes redes de comunicaciones, siendo el punto de partida para futuras investigaciones que mejoren la capacidad de detección de ataques en este tipo de redes.[EN] APTs represent a significant concern for governments, organizations, and enterprises due to their persistent and stealthy nature. A key characteristic of APTs is the generation of malicious traffic at various stages of the attack lifecycle. Prior literature has demonstrated the feasibility of detecting malicious traffic by leveraging machine learning models that are trained on network packets. Network packets contain all the information that is exchanged in network communications, including the payload. However, in networks that handle an enormous amount of traffic, analyzing every single packet that routers manage may not be feasible. To address this challenge, lightweight flow-based protocols are employed to analyze network activity in such infrastructures. A flow is defined as a set of IP network-layer datagrams that traverse through an observation point in the network during a specific time interval. All datagrams belonging to the same flow share certain characteristics such as IP addresses and ports, both source, and destination. Unlike network packets, network flows do not store the packet payload, thus reducing the computational load on routers. However, this approach also results in a loss of a significant portion of information contained in the datagrams. There are networks that manage so much traffic that in order to reduce the computational load on their devices, its need to select one packet from each X when generating network flows. This process is known as sampling. The primary objective of this study is to detect malicious network traffic, including that which may be generated by APTs, in such infrastructures using machine learning techniques. This research aims to enhance the security of companies, organizations, and end-users by identifying and mitigating potential cyber threats through the analysis of network flows. In order to train machine learning models effectively, it is essential to have access to accurately labeled datasets. To this end, DOROTHEA has been developed and validated as an implementation of a previously proposed framework. The primary function of DOROTHEA is to generate datasets comprising network flows generated using various sampling thresholds. This tool is designed to be both flexible and scalable, and allows for the generation of both malicious and benign traffic flows in isolation, thereby enabling unambiguous labeling of the generated flows. The use of DOROTHEA facilitates the training of machine learning models with accurate and relevant data, ultimately leading to more effective detection of malicious network traffic. To evaluate the feasibility of detecting malicious traffic in networks utilizing flow-based protocols and packet sampling, two distinct methodologies have been employed. The first approach utilizes supervised learning algorithms to train models for detecting malicious traffic, with an aim to investigate the impact of varying sampling thresholds on the efficacy of the models. The second approach employs anomaly detection techniques to identify potentially malicious traffic patterns in network flows. By utilizing these two complementary approaches, this research aims to provide a comprehensive evaluation of the ability of machine learning techniques to detect malicious network traffic in complex, high-traffic infrastructures. In the first supervised learning approach, datasets comprising network flows collected under varying sampling thresholds were generated using DOROTHEA. These datasets contained instances of port scanning attacks and were generated using sampling thresholds of 1/250, 1/500, 1/1,000, and 10,000. The generated datasets were then used to train and evaluate the performance of several machine learning models, including KNN, LR, LSVC, LSVC+SGD, MLP, and RF. To evaluate the generalizability of these models, they were subsequently tested using a dataset collected from the production routers of RedCAYLE, which is the regional academic network of Castilla y León, as well as a publicly available dataset known as BoT-IoT. By testing the models using datasets from different sources, this research aims to demonstrate the ability of the supervised learning models to effectively detect malicious network traffic in diverse environments. The results of the study demonstrate that machine learning-based detection models can effectively identify malicious network traffic in sampled flow data. However, the performance of the models varies significantly depending on the sampling rate used. Specifically, as the sampling threshold increases, certain models experience a decrease in their detection capability. However, it was found that the KNN, MLP, and RF models were able to maintain their detection capability across all of the studied thresholds. Among these models, the KNN model demonstrated the most robust performance across all sampling thresholds. In order to investigate the ability of anomaly-detection-based models to identify malicious network traffic in high-traffic network infrastructures, two models, namely OC-SVM and iForest, were evaluated using synthetic sampled flow data and actual sampled flow data from the RedCAYLE network. These models were not trained to detect a specific type of attack, but rather to identify normal network traffic patterns and classify any traffic that deviated from those patterns as anomalous. During the evaluation, the training datasets only contained benign traffic, while the test datasets included port scanning attacks. To ensure the robustness of the models, the synthetic dataset also included SQL injection attacks, which are significantly different from port scanning attacks. The results of the evaluation showed that both models were able to successfully identify anomalous traffic patterns in the test datasets, indicating that anomaly-detection-based models can be an effective tool for detecting malicious network traffic in high-traffic network infrastructures. The OC-SVM model demonstrated high accuracy in detecting network attacks as anomalies in both the synthetic traffic and flow data collected from RedCAYLE routers, indicating that such models based on anomaly detection can potentially detect unknown or zero-day attacks. In conclusion, the experiments conducted in this study demonstrate that it is possible to detect malicious traffic in networks that handle a large amount of traffic by using machine learning-based detection models and anomaly detection-based models. This capability allows potential victims to be warned of threats and increases the overall security of the network. This thesis opens up a number of possibilities for the future in relation to the detection of malicious traffic in large communication networks, being the starting point for further research to improve the ability to detect attacks in such networks

Leon University (Spain)

Vision-Based Module for Herding with a Sheepdog Robot

Author: Campazas-Vega Adrián
Riego Del Castillo Virginia
Strisciuglio Nicola
Sánchez-González Lidia
Publication venue: 'MDPI AG'
Publication date: 01/07/2022
Field of study

Livestock farming is assisted more and more by technological solutions, such as robots. One of the main problems for shepherds is the control and care of livestock in areas difficult to access where grazing animals are attacked by predators such as the Iberian wolf in the northwest of the Iberian Peninsula. In this paper, we propose a system to automatically generate benchmarks of animal images of different species from iNaturalist API, which is coupled with a vision-based module that allows us to automatically detect predators and distinguish them from other animals. We tested multiple existing object detection models to determine the best one in terms of efficiency and speed, as it is conceived for real-time environments. YOLOv5m achieves the best performance as it can process 64 FPS, achieving an mAP (with IoU of 50%) of 99.49% for a dataset where wolves (predator) or dogs (prey) have to be detected and distinguished. This result meets the requirements of pasture-based livestock farms

Directory of Open Access Journals

PubMed Central

University of Twente Research Information

Flow-Data Gathering Using NetFlow Sensors for Fitting Malicious-Traffic Detection Models

Author: Adrián Campazas-Vega
Camino Fernández-Llamas
Ignacio Samuel Crespo-Martínez
Ángel Manuel Guerrero-Higueras
Publication venue: 'MDPI AG'
Publication date: 18/12/2020
Field of study

Advanced persistent threats (APTs) are a growing concern in cybersecurity. Many companies and governments have reported incidents related to these threats. Throughout the life cycle of an APT, one of the most commonly used techniques for gaining access is network attacks. Tools based on machine learning are effective in detecting these attacks. However, researchers usually have problems with finding suitable datasets for fitting their models. The problem is even harder when flow data are required. In this paper, we describe a framework to gather flow datasets using a NetFlow sensor. We also present the Docker-based framework for gathering netflow data (DOROTHEA), a Docker-based solution implementing the above framework. This tool aims to easily generate taggable network traffic to build suitable datasets for fitting classification models. In order to demonstrate that datasets gathered with DOROTHEA can be used for fitting classification models for malicious-traffic detection, several models were built using the model evaluator (MoEv), a general-purpose tool for training machine-learning algorithms. After carrying out the experiments, four models obtained detection rates higher than 93%, thus demonstrating the validity of the datasets gathered with the tool

Multidisciplinary Digital Publishing Institute

Malicious traffic detection on sampled network flow data with novelty-detection-based models

Author: Adrián Campazas-Vega
Camino Fernández-Llamas
Claudia Álvarez-Aparicio
Ignacio Samuel Crespo-Martínez
Vicente Matellán
Ángel Manuel Guerrero-Higueras
Publication venue: Nature Portfolio
Publication date: 01/09/2023
Field of study

Abstract Cyber-attacks are a major problem for users, businesses, and institutions. Classical anomaly detection techniques can detect malicious traffic generated in a cyber-attack by analyzing individual network packets. However, routers that manage large traffic loads can only examine some packets. These devices often use lightweight flow-based protocols to collect network statistics. Analyzing flow data also allows for detecting malicious network traffic. But even gathering flow data has a high computational cost, so routers usually apply a sampling rate to generate flows. This sampling reduces the computational load on routers, but much information is lost. This work aims to demonstrate that malicious traffic can be detected even on flow data collected with a sampling rate of 1 out of 1,000 packets. To do so, we evaluate anomaly-detection-based models using synthetic sampled flow data and actual sampled flow data from RedCAYLE, the Castilla y León regional subnet of the Spanish academic and research network. The results presented show that detection of malicious traffic on sampled flow data is possible using novelty-detection-based models with a high accuracy score and a low false alarm rate

Directory of Open Access Journals